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Toward  Quality  Data:   An  Attribute-Based  Approach 


ABSTRACT  The  need  for  a  quality  perspective  in  the  nianagement  of  the 
data  resource  is  becon^ing  increasingly  critical.  Managing  data  quality,  however,  is  a 
complex  task.  Although  it  would  be  ideal  to  achieve  zero  defect  data,  this  may  not 
always  be  attainable.  Moreover,  different  users  may  have  different  criteria  in 
determining  the  quality  of  data.  This  suggests  that  it  would  be  useful  to  be  able  to 
tag  data  with  quality  indicators  which  are  characteristics  of  the  data  and  its 
manufacturing  process.  From  these  quality  indicators,  users  can  make  their  own 
judgment  of  the  quality  of  the  data  for  the  specific  application  at  hand. 

This  paper  investigates  how  quality  indicators  may  be  specified,  stored, 
retrieved,  and  processed.  Specifically,  we  propose  an  attribute-based  data  model  that 
facilitates  cell-level  tagging  of  data.  Included  in  this  attribute-based  model  are  a 
mathematical  model  description  that  extends  the  relational  model,  a  set  of  quality 
integrity  rules,  and  a  quality  indicator  algebra  which  can  be  used  to  process  SQL 
queries  that  are  augmented  with  quality  indicator  requirements.  From  these  quality 
indicators,  the  user  can  make  a  better  interpretation  of  the  data  and  determine  the 
believability  of  the  data.  In  order  to  establish  the  relationship  between  data  quality 
dimensions  and  quality  indicators,  a  data  quality  requirements  analysis 
methodology  that  extends  the  Entity  Relationship  model  is  also  presented. 
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Toward  Quality  Data:   An  Attribute-Based  Approach 

1.        Introduction 

Organizations  in  industries  such  as  banking,  insurance,  retail,  consumer  marketing,  and  health 
care  are  increasingly  integrating  their  business  processes  across  functional,  product,  and  geographic 
lines.  The  integration  of  these  business  processes,  in  turn,  accelerates  demand  for  more  effective 
application  systems  for  product  development,  product  delivery,  and  customer  service  (Rockart  St.  Short, 
1989).  As  a  result,  many  applications  today  require  access  to  corporate  functional  and  product 
databases.  Unfortunately,  most  databases  are  not  error-free,  and  some  contain  a  surpnsingly  large 
number  of  errors  (Johnson,  Leitch,  it  Neter,  1981).  In  a  recent  industry  executive  report,  Computerworld 
surveyed  500  medium  size  corporations  (with  annual  sales  of  more  than  $20  million),  and  reported  that 
more  than  60%  of  the  firms  had  problems  in  data  quality.^    The  Wall  Street  lournal  also  reported  that: 

Thanks  to  computers,  huge  databases  brimming  with  information  are  at  our  fingertips,  )ust 
waiting  to  be  tapped.  They  can  be  mined  to  find  sales  prospects  among  existing  customers;  they 
can  be  analyzed  to  unearth  costly  corporate  habits,  they  can  be  manipulated  to  divine  future 
trends.  Just  one  problem:  Those  huge  databases  may  be  full  of  |unk.  ...  In  a  world  where  p>eople 
are  moving  to  total  quality  management,  one  of  the  cntical  areas  is  data.^ 

In  general,  inaccurate,  out-of-date,  or  incomplete  data  can  have  significant  impacts  both 
socially  and  economically  (Laudon,  1986;  Liepins  &  Uppuluri,  1990;  Liepins,  1989;  Wang  k  Kon,  1992; 
Zarkovich,  1966).  Managing  data  quality,  however,  is  a  complex  task.  Although  it  would  be  ideal  to 
achieve  zero  defect  data}  this  may  not  always  be  necessary  or  attainable  for,  among  others,  the 
following  two  reasons: 

First,  in  many  applications,  it  may  not  always  be  necessary  to  attain  zero  defect  dau.  Mailing 
addresses  in  database  marketing  is  a  good  example.  In  sending  promotional  materials  to  target 
customers,  it  is  nfil  necessai7  to  have  the  correct  city  name  in  an  address  as  long  as  the  zip  code  is  correct. 

Second,  there  is  a  cost/quality  tradeoff  in  implementing  data  quality  programs.  Ballou  and 
Pazer  found  that  "in  an  overwhelming  ma)onty  of  cases,  the  best  solutions  in  terms  of  error  rate 
reduction  is  the  worst  in  terms  of  cost"  (Ballou  k  Pazer,  1987).  The  Pareto  Principle  also  suggests  that 
losses  are  nev«  uniformly  distributed  over  the  quality  characteristics.  Rather,  the  losses  are  always 
distributed  in  such  a  way  that  a  small  percentage  of  the  quality  characteristics,  "the  vital  few, " 
always  contributes  a  high  percentage  of  the  quality  loss.  As  a  result,  the  cost  improvenrtent  potential  is 


1  Computerworld,  Sept«nib«r  28,  1992.  p.  80-84. 

2  Th€  WaU  Slr*«t  Joum*!,  May  26,  1992,  pagt  B6. 

3  )ust  like  the  weil  puWldzed  coiKept  of  itw  dtfect  products  m  the  manufactunng  literittir*. 


high  for  "the  vital  few"  projects  whereas  the  "trivial  many"  defects  are  not  worth  tackling  because  the 
cure  costs  more  than  the  disease  (Juran  &  Cryna,  1980).  in  sum,  when  the  cost  is  prohibitively  high,  it  is 
not  feasible  to  attain  zero  defect  data. 

Given  that  zero  defect  data  may  not  always  be  necessary  nor  attainable,  it  would  be  useful  to  be 
able  to  judge  the  quality  of  data.  This  suggests  that  we  tag  data  with  quality  indicators  which  are 
characteristics  of  the  data  and  its  manufacturing  process.  From  these  quality  indicators,  the  user  can 
make  a  judgment  of  the  quality  of  the  data  for  the  specific  application  at  hand.  In  making  a  financial 
deasion  to  purchase  stocks,  for  example,  it  would  be  useful  to  know  the  quality  of  data  through  quality 
indicators  such  as  who  originated  the  data,  when  the  data  was  collected,  and  how  the  data  was 
collected. 

In  this  paper,  we  pro|X)se  an  attribute-based  model  that  facilitates  cell-level  tagging  of  data. 
Included  in  this  attribute-based  model  are  a  mathematical  model  description  that  extends  the 
relational  model,  a  set  of  quality  integrity  rules,  and  a  quality  indicator  algebra  which  can  be  used  to 
process  SQL  queries  that  are  augmented  with  quality  indicator  requirements.  From  these  quality 
indicators,  the  user  can  make  a  better  interpretation  of  the  data  and  determine  the  believability  of  the 
data.  In  order  to  establish  the  relationship  between  data  quality  dimensions  and  quality  indicators,  a 
data  quality  requirements  analysis  methodology  that  extends  the  Entity  Relationship  (ER)  model  is 
also  presented. 

Just  as  it  is  difficult  to  manage  product  quality  without  understanding  the  attnbutes  of  the 
product  which  define  its  quality,  it  is  also  difficult  to  manage  data  quality  without  understanding  the 
characteristics  that  define  data  quality.  Therefore,  before  one  can  address  issues  involved  in  data 
quality,  one  must  define  what  data  quality  means.  In  the  following  subsecfion,  we  present  a  definition 
for  the  dimensioru  of  data  quality. 

LL Dimension  of  data  guall^ 

Accuracy  is  the  most  obvious  dimension  when  it  comes  to  data  quality.  Morey  suggested  that 
"errors  ocair  beciuse  of  delays  in  processing  times,  lengthy  correction  times,  and  overly  or  insufficiently 
stringent  data  edits"  (Morey,  1982).  In  addifion  to  defining  accuracy  as  "the  recorded  value  is  m 
conformity  with  the  actual  value,"  Ballou  and  Pazer  defined  timeliness  (the  recorded  value  is  not  out 
of  date),  completeness  (all  values  for  a  certain  variables  are  recorded),  and  consistency  (the 
representafion  of  the  data  value  is  the  same  in  all  cases)  as  the  key  dimensions  of  data  quality  (Ballou 


&  Pazer,  1987).    Huh  et  al.  identified  accuracy,  completeness,  consistency,  and  currency  as  the  most 
important  dimensions  of  data  quality  (Huh,  ot  al.,  1990) 

It  IS  interesting  to  note  that  although  methods  tor  quality  control  have  been  well  established  in 
the  manufactunng  field  (e.g.,  Juran,  1979),  neither  the  dimensions  of  quality  for  manufacturing  nor  for 
data  have  been  ngorously  defined  (Ballou  k  Pazer,  1985,  Garvin,  1983;  Garvin,  1987;  Garvin,  1988; 
Huh,  et  al  ,  1990;  Juran,  1979;  Juran  <Sc  Gryna,  1980,  Morey,  1982;  Wang  &  Guarrascio,  1991).  It  is  also 
interesting  to  note  that  there  are  two  intrinsic  characteristics  of  data  quality: 

(1)  Data  quality  is  a  multi-dimensional  concept. 

(2)  Data  quality  is  a  hierarchical  concept. 

We  illustrate  these  two  characteristics  by  considering  how  a  user  may  make  decisions  based  on 
certain  data  retneved  from  a  database.  First  the  user  must  be  able  to  get  to  the  data,  which  means  that 
the  data  must  be  accessible  (the  user  has  the  means  and  privilege  to  get  the  data).  Second,  the  user 
must  be  able  to  interpret  the  data  (the  user  understands  the  syntax  and  semantics  of  the  data).  Third, 
the  data  must  be  useful  (data  can  be  used  as  an  input  to  the  user's  decision  making  process).  Finally,  the 
data  must  be  believable  to  the  user  (to  the  extent  that  the  user  can  use  the  data  as  a  decision  input). 
Resulting  from  this  list  are  the  following  four  dimensions:  accessibility,  interpretability,  usefulness, 
and  believability.  In  order  to  he  accessible  to  the  user,  the  data  must  be  available  (exists  in  some  form 
that  can  be  accessed);  to  be  useful,  the  data  must  be  relevant  (fits  requirements  for  making  the  deasion); 
and  to  be  believable,  the  user  may  consider,  among  other  factors,  that  the  data  be  complete,  timely. 
consistent,  credible,  and  accurate.  Timeliness,  in  turn,  can  be  characterized  by  currency  (when  the  data 
item  was  stored  in  the  database)  and  volatility  (how  long  the  item  renruiins  valid).  Figure  1  depicts 
the  data  quality  dimensions  illustrated  in  this  scenario. 


currtnt^  Qhon-vol«tll^ 
Figure  1:  A  Hierarchy  of  DaU  Quality  Dimensions 


These  multi-dimensional  concepts  and  hierarchy  of  data  quahty  dimensions  provide  a 
conceptual  framework  for  understanding  the  characteristics  that  define  data  quality.  In  this  paper,  we 
focus  on  interpretability  and  believability,  as  we  consider  cKCcssibility  to  be  primanly  a  function  of  the 
information  system  and  usefulness  to  be  primarily  a  function  of  an  interaction  between  the  data  and  the 
application  domain.   The  idea  of  data  tagging  is  illustrated  more  concretely  below. 

iu2, Data  quality:  an  attribute-based  example 

Suppose  an  analyst  maintains  a  database  on  technology  companies.  The  schema  used  to  support 
this  effort  may  contain  attributes  such  as  company  name,  CEO  name,  and  earnings  estimate  (Table  1). 
Data  may  be  collected  over  a  penod  of  time  and  come  from  a  vanety  of  sources. 

Table  1:  Company  Information 


Company  .Mame 

CEO  name 

Earnines  Estimate 

IBM 

Akers 

•J 

DELL 

Dell 

3 

As  part  of  determining  the  believability  of  the  data  (assuming  high  interpretability),  the 
analyst  may  want  to  know  when  the  data  was  generated,  where  it  came  from,  how  it  was  originally 
obtained,  and  by  what  means  it  was  recorded  into  the  database.  From  Table  1,  the  analyst  would  have 
no  means  of  obtaining  this  information.  We  illustrate  in  Table  2  an  approach  in  which  the  data  is 
tagged  with  quality  indicators  which  may  help  the  analyst  determine  the  believability  of  the  data. 

Table  2:  Company  information  with  quality  indicators 


Company  Name 

CEO  name 

Earninzs  Estimate 

IBM 

Akers 

7 
<source:  Barron's,  reporting_date:  10-05-92,  data_entry_operator:  Jo€> 

DELL 

Dell 

3 
<source:  WSJ,  reporting _date:  10-06-92,  data_entry_operator;  Mary> 

As  shown  in  Table  2,  "7,  (source:  Barron's,  reporting_date:  10-05-92,  data_entry_operator  Joe>" 
in  Column  3  indicates  that  "$7  rvas  the  Earnings  Estimate  of  IBM"  was  reported  by  the  Barron's  on 
October  5, 1992  and  was  entered  by  Joe.  An  experienced  analyst  would  know  that  Barron's  is  a  credible 
source;  that  October  5, 1992  is  ti^neiy  (assuming  that  October  5  was  recent);  and  that  Joe  is  experienced, 
therefore  the  data  is  likely  to  be  accurate.  As  a  result,  he  may  conclude  that  the  earnings  estimate  is 
believable.  This  example  both  illustrates  the  need  for,  and  provides  an  example  approach  for, 
incorporating  quality  indicators  into  the  database  through  data  tagging. 

IJ. Research  focus  and  paper  organization 

The  goal  of  the  attnbute-based  approach  is  to  facilitate  the  collection,  storage,  retrieval,  and 
processing  of  data  that  has  quality  indicators.   Central  to  the  approach  is  the  notion  that  an  attnbute 


value  may  have  a  set  of  quality  indicators  associated  with  it.  In  some  applications,  it  mav  be 
necessary  to  know  the  quality  of  the  quality  mdicators  themselves,  in  which  case  a  quality  indicator 
may,  in  turn,  have  another  set  of  associated  quality  indicators.  As  such,  an  attnbute  mav  have  an 
arbitrary  number  of  underlying  levels  of  quality  indicators.  This  consntutes  a  tree  structure,  as  shown  in 
Figure  2  below. 

(attribute) 


(indicator)  (indicatoT) 

zr\...  tK: 


(indicator)  ^    (indicator) 

Figure  2:  An  attribute  with  quality  indicators 

Conventional  spreadsheet  programs  and  database  systems  are  not  appropriate  for  handling 
data  which  is  structured  in  this  manner.  In  particular,  they  lack  the  quality  integrity  constraints 
necessary  for  ensunng  that  quality  indicators  are  always  tagged  along  with  the  data  (and  deleted 
when  the  data  is  deleted)  and  the  algebraic  operators  necessary  for  attnbute-based  query  processing. 
In  order  to  associate  an  attribute  with  its  immediate  quality  indicators,  a  mechanism  must  be 
developed  to  facilitate  the  linkage  between  the  two,  as  well  as  between  a  quality  indicator  and  the  set 
of  quality  indicators  associated  with  it. 

This  paper  is  organized  as  follows.  Section  2  presents  the  research  background.  Section  3 
presents  the  data  quality  requirements  analysis  methodology  In  section  4,  we  present  the  attnbute- 
based  data  model.  Discussion  and  future  direchons  are  made  in  Section  5. 

2.       Research  background 

In  this  section  we  discuss  our  rationale  for  tagging  data  at  the  cell  level,  summarize  the 
literature  related  to  data  tagging,  and  present  the  terminology  used  in  this  paper. 

■LI.       RarionaU  fnr  f»ll.l»v»»  tagging 

Any  characteristics  of  dau  at  the  relation  level  should  be  applicable  to  all  instances  of  the 
relation.  It  is,  however,  not  reasonable  to  assume  that  all  instances  (i.e.,  tuples)  of  a  relarion  have  the 
same  quality.  Therefore,  tagging  quality  indicators  at  the  relation  level  is  not  sufficient  to  handle 
quality  heterogeneity  at  the  instance  level. 


By  the  same  token,  any  characteristics  of  data  tagged  at  the  tuple  level  should  be  applicable 
to  all  attribute  values  in  the  tuple.  However,  each  attribute  value  in  a  tuple  may  be  collected  from 
different  sources,  through  different  collection  methods,  .ind  updated  at  different  points  m  time. 
Therefore,  tagging  data  at  the  tuple  level  is  also  msutticient  Smce  the  attribute  value  of  a  cell  is  the 
basic  unit  of  manipulation,  it  is  necessary  to  tag  quality  information  at  the  cell  level. 

We  now  examine  the  literature  related  to  data  tagging. 

2ol Work  related  to  data  tagging 

A  mechanism  for  tagging  data  has  been  proposed  by  Codd.  It  includes  NOTE,  T.AG,  and 
DENOTE  operations  to  tag  and  un-tag  the  name  of  a  relation  to  each  tuple.  The  purpose  of  these 
operators  is  to  permit  both  the  schema  information  and  the  database  extension  to  be  manipulated  in  a 
uniform  way  (Codd,  1979).  It  does  not,  however,  allow  for  the  tagging  of  other  data  (such  as  source)  at 
either  the  tuple  or  cell  level. 

Although  self-describing  data  files  and  meta-data  management  have  been  proposed  at  the 
schema  level  (McCarthy,  1982;  McCarthy,  1984;  McCarthy,  1988),  no  specific  solution  has  been  offered 
to  marupulate  such  quality  information  at  the  hjple  and  cell  levels. 

A  rule-based  representation  language  based  on  a  relational  schema  has  been  proposed  to  store 
data  semantics  at  the  instance  level  (Siegel  k  Madnick,  1991).  These  rules  are  used  to  denve  meta- 
attribute  values  based  on  values  of  other  attributes  in  the  tuple.  However,  these  rules  are  specified  at 
the  tuple  level  as  opposed  to  the  cell  level,  and  thus  cell-level  operations  are  not  inherent  in  the 
model. 

A  polygen  model  (px)Iy  =  multiple,  gen  =  source)  (Wang  &  Madnick,  1990)  has  been  proposed  to 
tag  multiple  data  sources  at  the  cell  level  in  a  heterogeneous  database  environment  where  it  is 
important  to  know  not  only  the  originating  data  source  but  also  the  intermediate  data  sources  which 
contribute  to  final  query  results.  The  research,  however,  focused  on  the  "where  from"  perspective  and 
did  not  provide  mechanisms  to  deal  with  more  general  quality  indicators. 

In  (Sciore,  1991),  annotations  are  used  to  support  the  temporal  dimension  of  data  in  an  object- 
oriented  environment  However,  data  quality  is  a  multi-dimensional  concept.  Therefore,  a  more 
general  treatment  is  necessary  to  address  the  data  quality  issue.  More  importantly,  no  algebra  or 
calculus-based  language  is  provided  to  support  the  manipulation  of  annotations  associated  with  the 
data. 


The  examination  of  the  above  research  efforts  suggests  that  in  order  to  support  the 
functionality  of  our  attnbute-based  model,  an  extension  of  existing  data  models  is  required. 

2^ — Tcnninology 

To  facilitate  further  discussion,  we  introduce  the  following  terms: 

•  An  application  ^ttrit>Mte  refers  to  an  attnbute  associated  with  an  entity  or  a  relationship  m  an 
entity-relationship  (ER)  diagram.  This  would  mclude  the  data  traditionally  associated  with 
an  application  such  as  part  number  and  supplier. 

•  A  quality  parameter  is  a  qualitative  or  subjective  dimension  of  data  quality  that  a  user  of  data 
defines  when  evaluating  data  quality.  For  example,  believability  and  timeliness  are  such 
dimensions. 

•  As  introduced  in  Section  1,  quality  indicators  provide  objective  information  about  the 
characteristics  of  data  and  its  manufacturing  process.^  Data  source,  creation  time,  and 
collection  method  are  examples  of  such  objective  measures. 

•  A  quality  parameter  value  is  the  value  determined  (directly  or  indirectly)  by  the  user  of  data 
for  a  particular  quality  parameter  based  on  underlying  quality  indicators.  Functions  can  be 
defined  by  users  to  map  quality  indicators  to  quality  parameters.  For  example,  the  quality 
parameter  credibility  may  be  defined  as  high  or  low  depending  on  the  quality  indicator  source 
of  the  data. 

•  A  quality  indicator  value  is  a  measured  characteristic  of  the  stored  data.  For  example,  the 
data  quality  indicator  source  may  have  a  quality  indicator  value  The  Wall  Street  journal. 

We  have  discussed  the  rationale  for  cell-level  tagging,  summarized  work  related  to  data 
tagging,  and  introduced  the  terminology  used  in  this  paper.  In  the  next  section,  we  present  a 
methodology  for  the  specification  of  data  quality  parameters  and  indicators.  The  intent  is  to  allow 
users  to  think  through  their  data  quality  requirements,  and  to  determine  which  quality  indicators 
would  be  appropriate  for  a  given  application. 


We  consider  u\  uidicatof  objective  li  it  i»  generated  using  a  well  defined  and  widely  accepted  m«««if«. 


3.        Data  quality  requirements  analysis 

In  general,  different  users  may  have  different  data  quality  requirements,  and  different  tvpes  of 
data  may  have  different  quality  characteristics.  The  reader  is  referred  to  Appendix  A  for  a  more 
thorough  treatment  of  these  issues. 

Data  quality  requirements  analysis  is  an  effort  similar  in  spirit  to  traditional  data 
requirements  analysis  (Batini,  Lenzirini,  it  X'avathe,  1986;  Mavathe,  Batini,  &  Ceri,  1992;  Teorey, 
1990),  but  focusing  on  quality  aspects  of  the  data.  Based  on  this  similarity,  parallels  can  be  drawn 
between  traditional  data  requirements  analysis  and  data  quality  requirements  analysis.  Figure  3 
depicts  the  steps  involved  in  performing  the  proposed  data  quality  requirements  arulysis. 
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Figure  3:  The  process  of  daU  quality  requirtmento  analysis 
The  input,  output  and  objective  of  each  step  are  descnbed  in  the  following  subsections. 


IL — Step  1:   Establishing  the  applications  vievy 

Step  1  IS  the  whole  of  the  traditional  data  modeling  process  and  will  not  be  elaborated  upon  in 
this  paper.  A  comprehensive  treatment  ot  the  subiect  has  been  presented  elsewhere  (Batini,  Leazirini, 
&c  Navathe,  1986;  Navathe,  Batini,  &c  Cen.  1992;  Teorev.  1990). 

For  illustrative  purposes,  suppose  that  we  ire  interested  in  designing  a  portfolio  management 
system  which  contains  companies  that  issue  stocks.  .A  company  has  a  company  name,  a  CEO,  and  an 
earnings  estimate,  while  a  stock  has  a  share  pnce.  a  stock  exchange  (NfYSE,  AMS,  or  OTC),  and  a  ticker 
symbol.  An  ER  diagram  that  documents  the  application  view  for  our  running  example  is  shown  below  in 
Figure  4  . 
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Figure  4:  Application  view  (output  from  Step  1) 
2JL Step  2:  Determine  (subjective)  quality  parametera 

The  goal  in  this  step  is  to  elicit  quality  parameters  from  the  user  given  an  application  view 
These  parameters  need  to  be  gathered  from  the  user  in  a  systematic  way  as  data  quality  is  a  multi- 
dimensional concept,  and  may  be  operationalized  for  tagging  purposes  in  different  ways.    Figure  5 
illustrates  the  addition  of  the  two  high  level  parameters,  interpretability  and  believability.  to  the 
application  view.   Each  quality  parameter  identified  is  shown  inside  a  "cloud"  in  the  diagram. 


Figure  5:  Interpretability  and  believability  added  to  the  application  view 


Interpretability  can  be  defined  through  quality  indicators  such  as  data  units  (eg.,  in  dollars) 
and  scale  (e.g.,  in  millions).  Believabilitv  can  be  defined  in  terms  ot  lower-level  quality  parameters 
such  as  completeness,  timeliness,  consistency,  credibility,  and  accuracy.  Timeliness,  in  turn,  can  be 
defined  through  currency  and  volatility.  The  quality  parameters  identified  in  this  step  are  added  to 
the  application  view.  The  resulting  view  is  referred  to  as  the  parameter  view.  We  focus  here  on  the 
stock  entity  which  is  shown  in  Figure  6. 
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Figure  6:  Parameter  view  for  the  stock  entity  (partial  output  from  Step  2) 


i^ Step  3:  Determine  (objective)  quality  indicators 

The  goal  in  Step  3  is  to  operationaiize  the  pnmarily  subjective  quality  parameters  identified 
in  Step  2  into  objective  quality  indicators.  Each  quality  indicator  is  depicted  as  a  tag  (using  a  dotted- 
rectangle)  and  is  attached  to  the  corresponding  quality  parameter  (from  Step  2),  creating  the  quality 
view.  The  portion  of  the  quality  view  for  the  stock  entity  in  the  running  example  is  shown  in  Figure  7. 
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Figure  7:  The  portion  of  the  quality  view  for  the  stock  entity  (output  from  Step  3) 
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Corresponding  to  the  quality  parameter  intcrpretable  are  the  more  objective  quality  indicators 
currency  umts  in  which  share  price  is  measured  (e  g  ,  5  vs  V)  and  status  which  says  whether  the  share 
pnce  IS  the  latest  closing  price  or  latest  nominal  price  Similarly,  the  believabilitv  of  the  share  price 
IS  indicated  by  the  quality  indicators  source  and  reportin)^  Jate. 

For  each  quality  indicator  identified  in  a  quality  view,  if  it  is  important  to  have  quality 
indicators  for  a  quality  indicator,  then  Steps  2-3  are  repeated,  making  this  an  iterative  process.  For 
example,  the  quality  of  the  attnbute  Earnings  Estimate  may  depend  not  only  on  the  first  level  source 
(i.e.,  the  name  of  the  )Ournal)  but  also  on  the  second  level  source  (i.e.,  the  name  of  the  financial  analyst 
who  provided  the  Earnings  Estimate  figure  to  the  journal  and  the  Reporting  date).  This  scenario  is 
depicted  l>elow  in  Figure  8. 


.ANALYSTS  NAME   '        REPORTINQ  DATE' 


Figure  8:  Quality  indicators  of  quality  indicators 

All  quality  views  are  integrated  in  Step  4  to  generate  the  quality  schenia,  as  discussed  in  the 
following  subsection. 

l± Stgp  4:   Crgaring  tht  quality  vh^ma 

When  the  design  is  large  and  more  than  one  set  of  application  requirenrtents  is  involved, 
multiple  quality  views  may  result.  To  eliminate  redundancy  and  inconsistency,  these  quality  views 
must  be  consolidated  into  a  single  global  view,  in  a  process  similar  to  schema  integration  (Batini, 
Lenzirini,  k  Navathe,  1986),  so  that  a  variety  of  data  quality  requirements  can  be  met.  The  resulting 
single  global  view  is  called  the  quality  schema. 

This  involves  the  integration  of  quality  indicators.  In  simpler  cases,  a  union  of  these  indicators 
may  suffice.  In  more  complicated  cases,  it  may  be  necessary  to  examine  the  relationships  among  the 
indicators  in  order  to  decide  what  indicators  to  include  in  the  quality  schema.  For  example,  it  is  likely 
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that  one  quality  view  may  have  a^  as  an  indicator,  whereas  another  quality  view  mav  have  creation 
tim^  for  the  same  quality  parameter.  In  this  case,  creation  time  may  be  chosen  for  the  qualitv  schema 
because  age  can  be  computed  given  current  time  and  creation  time. 

We  have  presented  a  step-by-step  procedure  to  specify  data  quality  requirements.  We  are  now 
in  a  position  to  present  the  attnbute-based  data  model  tor  supporting  the  storage,  retrieval,  and 
processing  of  quality  indicators  as  specified  in  the  quality  schema. 

4.        The  attribute-based  model  of  data  quality 

We  choose  to  extend  the  relational  model  because  the  structure  and  semanhcs  of  the  relational 
approach  are  widely  understood.  Following  the  relational  model  (Codd,  1982),  the  presentation  of  the 
attribute-based  data  model  is  divided  into  three  parts;  (a)  data  structure,  (b)  data  integrity,  and  (c) 
data  manipulation.  We  assume  that  the  reader  is  familiar  with  the  relational  model  (Codd,  1970; 
Codd,  1979;  Date,  1990;  Maier,  1983). 

il, Data  structure 

As  shown  in  Figure  2  (Section  1),  an  attribute  may  have  an  arbitrary  number  of  underlying 
levels  of  quality  indicators.  In  order  to  associate  an  attribute  with  its  immediate  quality  indicators,  a 
mechanism  must  be  developed  to  facilitate  the  linkage  between  the  two,  as  well  as  between  a  quality 
indicator  and  the  set  of  quality  indicators  associated  with  it.  This  mechanism  is  developed  through 
the  quality  key  concept.  In  extending  the  relational  model,  Codd  made  clear  the  need  to  uniquely 
identify  tuples  through  a  system-wide  unique  identifier,  called  the  tuple  ID  (Codd,  1979;  Khoshafian 
Sc  Copeland,  1990). ^  This  concept  is  applied  in  the  attnbute-based  model  to  enable  this  linkage. 
Specifically,  an  attribute  in  a  relation  scheme  is  expanded  into  an  ordered  pair,  called  a  qualitv 
attribute,  consisting  of  the  attribute  and  a  qualitv  key. 

For  example,  the  attribute  Earnings  Estimate  (EE)  in  Table  3  is  expanded  into  (EE,  EEt)  in  Table 
4  where  EEr  is  the  quality  key  for  the  attribute  EE  (Tables  3-6  are  embedded  in  Figure  9).  This 
expanded  scheme  is  referred  to  as  a  qualitv  scheme.  In  Table  4,  «CN,  nil«>,  (CEO,  nilc),  (EE,  EEt)) 
defines  a  quality  scheme  for  the  quality  relation  Company.  The  "nilc"  indicates  that  no  quality 
indicators  are  associated  with  the  attributes  CN  and  CEO;  whereas  EEc  indicates  that  EE  has 
associated  quality  indicators. 

Correspondingly,  each  cell  in  a  relational  tuple  is  expanded  into  an  ordered  pair,  called  a 
quality  cell,  consisting  of  an  attribute  value  and  a  quality  key  value.  This  expanded  tuple  is  referred  to 


SuniUrly,  in  the  ob)«ct-on«nt«d  lHtr«ture,  the  ability  to  mak«  references  through  objtet  infinity  is  corsdered  a  basic 
property  of  an  ob^-onenled  data  model. 
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as  a  quality  tupl^  and  the  resulting  relation  (Table  4)  is  referred  to  as  a  quality  relation.  Each  quality 
key  value  in  a  quality  cell  refers  to  the  sot  ot  quality  indicator  values  immediately  associated  with 
the  attribute  value.  This  set  of  quality  indicator  values  is  grouped  together  to  form  a  kind  of  quality 
tuple  called  a  quality  indicator  tuple.  .A  quality  relation  composed  of  a  set  of  these  time-varying 
quality  indicator  tuples  is  called  a  qualitv  indicator  relation.  The  quality  scheme  that  defines  the 
quality  indicator  relation  is  referred  to  as  the  quality  indicator  scheme. 

Under  the  relational  model 

Table  3:  Relation  for  Company 


tid 


idOOU 
id002(j 


Company 
.Name  tCN) 


:E0  Name 
iCEO) 


IBM 
DELL 


Akers 
Dell 


Earnings 
Estimate  (EE) 


Under  the  attribute-based  model 
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Table  5;  Level-One  QIR  for  the  EE  attnbute 
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Tables  6:  Level-Two  QIR  for  the  EE  attribute 
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Figure  9:  The  Quality  Scheme  Set  for  Company 

The  quality  key  thus  serves  as  a  foreign  key,  relating  an  attribute  (or  quality  indicator)  value 
to  its  associated  quality  indicator  tuple.  For  example.  Table  5  is  a  quality  indicator  relation  for  the 
attnbute  Earnings  Estimate  and  Table  6  is  a  quality  indicator  relation  for  the  attnbute  SRCl  (source  of 
data)  in  Table  5.  The  quality  cell  <Wall  St  )nl,  id202«)  in  Table  5  contains  a  quality  key  value,  id202<, 
which  is  a  tuple  id  (primary  key)  in  Table  6. 

Let  qri  be  a  quality  relation  and  a  an  attnbute  in  qri.  If  a  has  associated  quality  indicators, 
then  its  quality  key  must  be  non-null  (i.e.,  not  "nil<").  Let  qr2  be  the  quality  indicator  relation 
containing  a  quality  indicator  tuple  for  a,  then  all  the  attributes  of  qr2  are  called  level-one  quality 
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indicators  for  a.  Each  attnbute  in  qrj  ,  in  turn,  can  have  a  quality  indicator  relation  associated  with  it. 
In  general,  an  attribute  can  have  n-levels  of  quality  indicator  relations  associated  with  it,  n  s  0.  For 
example.  Tables  5-6  are  referred  to  respectively  as  Icvol-ono  and  level-two  quality  indicator  relations 
for  the  attribute  Earnings  Estimate. 

We  define  a  quality  scheme  set  as  the  collection  of  a  quality  scheme  and  all  the  quality 
indicator  schemes  that  are  associated  with  it.  In  Figure  9,  Tables  3-6  collectively  define  the  quality 
scheme  set  for  Company.  We  define  a  quality  database  as  a  database  that  stores  not  only  data  but  also 
quality  indicators.  A  quality  schema  is  defined  as  a  set  of  quality  scheme  sets  that  describes  the 
structure  of  a  quality  database.  Figure  10  illustrates  the  relationship  among  quality  schemes,  quality 
indicator  schemes,  quality  scheme  sets,  and  the  quality  schema. 


Quality 
Schema 


Figtire  10  Quality  schemes,  quality  indicator  schemes,  quality  scheme  sets,  and  the  quality  schema 

We  now  present  a  matheinatical  definition  of  the  quality  relation.  Following  the  constructs 
developed  in  the  relahonal  model,  we  define  a  domain  as  a  set  of  values  of  sirrular  type.  Let  ID  be  the 
domain  for  a  system-wide  unique  identifier  (in  Table  4,  idlOlc  €  ID).  Let  D  be  a  domain  for  an  attnbute 
(in  Table  4,  7  €  EE  where  EE  is  a  don^ain  for  earnings  estimate).  Let  DID  be  defined  on  the  Cartesian 
product  D  X  ID  (in  Table  4,  <7.  idlOU)  €  DID). 

Let  uf  be  a  quality  key  value  associated  with  an  attribute  value  d  where  d  €  D  and  id  €  ID.  A 
quality  relation  (qr)  of  degree  tn  is  defined  on  the  m+1  domains  (m>0;  in  Table  4,  m»3)  if  it  is  a  subset  of 
the  Cartesian  product: 

ID  X  DIDi  X  DID2  X  ...  X  DIDm- 

Let  (jt  be  a  quahty  tuple,  which  is  an  element  in  a  quality  relation.  Then  a  quality  relation  qr 
is  designated  as: 

(^  «  Iqt  I  qt  »  <id,  did],  did2,  •■•,  didm)  where  id  «  ID,  didj  c  DID},  j  -  1, ...  ,m) 

The  integrity  constramts  for  the  attnbute-based  model  is  presented  next. 
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4J.      Data  integrity 

A  fundamental  property  of  the  attribute-based  model  is  that  an  attribute  value  and  its 
corresponding  quality  (including  all  descendant)  indicator  values  are  treated  as  an  atomic  unit.  Bv 
atomic  unit  we  mean  that  whenever  an  attribute  ".alue  is  created,  deleted,  retrieved,  or  modified,  its 
corresponding  quality  indicators  also  need  to  bo  created,  deleted,  retrieved,  or  modified  respectively 
In  other  words,  an  attribute  value  and  its  corresponding  quality  indicator  values  behave  atomicallv 
We  refer  to  this  property  as  the  atomicity  property  hereafter.  This  property  is  enforced  by  a  set  of 
quality  referential  integrity  rules  as  defined  below. 

Insertion:  Insertion  of  a  tuple  in  a  quality  relanon  must  ensure  that  for  each  non-null  quality 
key  present  in  the  tuple  (as  specified  in  the  quality  schema  definition),  the  corresponding  quality 
indicator  tuple  must  be  inserted  into  the  child  quality  indicator  relation.  For  each  non-null  quality  key 
in  the  inserted  quality  indicator  tuple,  a  corresponding  quality  indicator  tuple  must  be  inserted  at  the 
next  level.  This  process  must  be  continued  recursively  until  no  more  insertions  are  required. 

Deletion:  Deletion  of  a  tuple  in  a  quality  relation  must  ensure  that  for  each  non-null  quality 
key  present  in  the  tuple,  corresponding  quality  information  must  be  deleted  from  the  table 
corresponding  to  the  quality  key.  This  process  must  be  continued  recursively  until  a  tuple  is  encountered 
with  all  null  quality  keys. 

Modification:  If  an  attribute  value  is  modified  in  a  quality  relation,  then  the  descendant 
quality  indicator  values  of  that  attribute  must  be  modified. 

We  now  introduce  a  quality  indicator  algebra  for  the  attnbute-based  model. 

4A      Data  manipulation 

In  order  to  present  the  algebra  formally,  we  first  define  two  key  concepts  that  are  fundamental 
to  the  quality  indicator  algebra:  Ql-compatibilitv  and  OIV-Equal. 

4.3.1.     Ql-Compatibility  and  QIV-Equal 

Let  a,  and  aj  be  two  application  attributes.  Let  QI(a,)  denote  the  set  of  quality  indicators 
associated  with  ai-  Let  S  be  a  set  of  quality  indicators.  If  S  i;;  QKa,)  and  S  C  QKaj),  then  a,  and  a^ 
are  defined  to  be  Ql-Compatible  with  respect  to  S.6  For  example,  if  S  =  (qii,  qij,  qiji),  then  the 
attributes  a,  and  aj  shown  in  Figure  H  are  Ql-Compatible  with  respect  to  S.  Whereas  if  S  =  (qii ,  qi22). 
then  the  attributes  a,  and  aj  shown  in  Figure  11  are  ncji  Ql-Comparible  with  respect  to  S. 


6  We  assume  that  the  numeric  subscripts  (eg,  qi,,)  map  the  quality  indicators  to  unique  positions  in  the 

quality  indicator  tree. 
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Figure  11:  Ql-Compatibility  Example 

Let  a,  and  a2  be  Ql-Compatible  with  respect  to  S.  Let  w,  and  wj  be  values  of  a.  and  a; 
respectively.  Let  qHw,)  be  the  value  of  quality  indicator  qi  for  the  attribute  value  w,  where  qi  e  S 
(qi2(wi)  =  V2  in  Figure  12).  Define  w,  and  wj  to  be  QIV-Equal  with  resfject  to  S  provided  that  qi(w; )  = 
qi(w2)  V  qi  e  S,  denoted  as  w,  =  W2.  In  Figure  12,  for  example,  w,  and  W2  are  QIV-Equal  with  respect  to 
S  =  (qi,,  qi2i),  but  n^l  QIV-Equal  with  respect  to  S  =  (qi,,  qh,)  because  qi3i(w,)  =  V3,  whereas  qi3](w2)  = 
X3,. 


(q'l- V,)      (qi2-V2)    (PI3.  ^3) 

/\     \\\ 


(32,^2) 


(qi,.v,)  (qij.v^)    (qij.vj) 


/\         \  \ 


C^r^n^     (^'l2-^12)<'"2r''21^     (l*22-^22)  «"3r^31>  C'n  ■  ^1 1 )    C'lz^  ^  3)  ('"2r  ^2i)         ('"3r'3i) 

y 

Figure  12:  QIV-Equal  Example 

In  practice,  it  is  tedious  to  explicitly  state  all  the  quality  indicators  to  be  compared  (i.e.,  to 
sfjecify  all  the  elements  of  S).  To  alleviate  the  situation,  we  introduce  i-level  Ql-compatibility  (i- 
level  QIV-Equal)  as  a  special  case  for  Ql-compatibility  (QlV-equal)  in  which  aU.  the  quality 
indicators  up  to  a  certain  level  of  depth  in  a  quality  indicator  tree  are  considered. 

Let  a,  and  a^  be  two  application  attributes.  Let  a,  and  a^  be  Ql-Compatible  with  respect  to  S. 
Let  w,  and  W2  be  values  of  a,  and  aj  respectively,  then  w,  and  W2  are  defined  to  be  i-levgl  QI- 
Compatible  if  the  following  two  conditions  are  satisfied.  (1)  a,  and  82 are  Ql-Compatible  with  respect 
to  S  ,  and  (2)  S  consists  of  all  quality  indicators  present  within  i  levels  of  the  quality  indicator  tree  of 
a,  (thus of  82). 

By  the  same  token,  i-level  QIV-Equal  between  w,  and  W2,  denoted  by  w,  =>  W2,  can  be  defined. 

If  'i'  is  the  maximum  level  of  depth  in  the  quality  indicator  tree,  then  a,  and  a2  are  defined  to 
be  maximum-level  Ql-Comparible.  Similarly,  maximum-level  QIV-Equal  between  w,  and  W2,  denoted 
by  w,  =*"  W2,  can  also  be  defined. 
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To  exemplify  the  algebraic  operations  in  the  quality  indicator  algebra,  we  introduce  two 
quality  relations  having  the  same  quality  scheme  set  js  shown  in  Figure  9.  They  are  referred  to  as 
Large_and_Medium  (Tables  7,  7  1,  7  2  in  Figure  13)  and  Small_and. Medium  (Tables  8,  8  1,  and  8  2  in 
Figure  14). 


Table  7 


<CN.  nilj> 

<CEO,  n\\i> 

<EE.  EEr> 

<1BM,  n\\i> 
<DEC,  nil«> 
<TI,  nil«> 

<J  Akers,  nilo 
<K  Olsen,  nilf> 
<|  lunkins,  nil«> 

<bAJ8,  id010U> 
<-0  32.  id0l02«> 
<2.51,  id0103t> 

Table  7.1 


<EEe,  nile> 

<SRCl.SRC;c> 

<Reponing_date,  nil«> 

<id010U,  nil«> 
<id0102«,  nil«> 
<id0103«,  nilo 

<.\'exis,  id020U> 
<Nexis,  id0202«> 
<Loms,  id0203«> 

<10-07-92,  niie> 
<10-O7-92.  nilo 
<10-07-92.  nii«> 

Table  7.2 


<SRC1«.  nile> 

<SRC2,  niio 

<Repomng_date,  nilt> 

<id0201«,  nilt> 
<id0202<,  nil«> 
<id0203<,  nilo 

<Zacks,  nile> 
<First  Boston,  niU> 
<First  Boston,  nil«> 

<1 -07-92,  niU> 
<1 -07-92,  nilo 
<  1-07-92,  mi  «> 

Figure  13:  The  Quality  Relation  Large_and_Medium 


Table  8 


<CN,  nil«> 

<CEO,  nilo 

<EE,  EEo 

<Apple,  nilo 

<DEC,  nilo 

<TI,  nilo 

<J  Sculley,  nilo 
<K  Olsen,  nilo 
<J  Junkins,  nil«> 

<5.69,  idUOlO 
<-0.32,  idn02o 
<231,  idlKBo 

Table  8.1 


<EEe,  ml> 


<idll01c,  nUo 
<idn02«,  nilo 
<idll03<,  nilo 


<SRCl,SRClo 


<  Lotus,  idl201«> 
<Nexis,  idl202«> 

<  Lotus.  idl203o 


<Reporting_date,  nilo 


<10-07-92,  mlo 
<1 0-07-92,  mlo 
<1 0-07-92,  mlo 


Table  8^ 


<SROc,  mlo 

<SRC2,  nil«> 

<Reporting_date,  nilo 

<idl2Dle,  mlo 
<id12Ui(,  mlo 
<idl203«,  mlo 

<Zacks,  nilo 

<First  Boston,  nilt> 

<Zacks,  nilo 

<l-07-92.nUo 
<l-07-92,  nUo 
<l-07-9Znilo 

Figure  14:  The  Quality  Relation  Small_and_Medium 


These  two  quality  relations  will  be  used  to  illustrate  various  operations  of  the  quality 
indicator  algebra.  In  order  to  illustrate  the  relationship  between  the  quality  indicator  algebraic 
operations  and  the  high-level  user  query,  the  SELECT,  FROM,  WHERE  structure  of  SQL  is  extended 
with  an  extra  clause  "with  QUALITY."  This  extra  clause  enables  a  user  to  specify  the  quality 
requireirents  regarding  an  attributes  referred  to  in  a  query. 
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If  the  clause  'with  QUALITY"  is  absent  in  a  user  query,  then  it  means  that  the  user  has  no 
explicit  constraints  on  the  quality  of  data  that  is  being  retrieved.  In  that  case  quality  indicator  values 
would  not  be  compared  in  the  retrieval  process,  however,  the  quality  indicator  values  associated  with 
the  applications  data  would  be  retrieved  as  well. 

In  the  extended  SQL  syntax,  the  dot  notation  is  used  to  identify  a  quality  indicator  in  the 
quality  indicator  tree.  In  Figure  9,  for  example,  EE.SRCl.SRC2  identifies  SRC2  which  is  a  quality 
indicator  for  SRCl,  which  in  turn  is  a  quality  indicator  to  EE. 

The  quality  indicator  algebra  is  presented  in  the  following  subsection. 

4.3.2.     Quality  Indicator  Algebra 

Following  the  relational  algebra  (Klug,  1982),  we  define  the  five  orthogonal  quality  relational 
algebraic  operations,  namely  selection,  projection,  union,  difference,  and  Cartesian  product. 

In  the  following  operations,  let  QR  and  QS  be  two  quality  schemes  and  let  qr  and  qs  be  two 
quality  relations  associated  with  QR  and  QS  respectively.  Let  a  and  b  be  two  attributes  in  both  QR  and 
QS.  Let  t,  and  tj  be  two  quality  tuples.  Let  S^  be  a  set  of  quality  indicators  specified  by  the  user  for  the 
attribute  a.  (That  is,  S^  is  constructed  form  the  specifications  given  by  the  user  in  the  "with 
QUALITY"  clause.)  Let  the  term  ti.a  =  tj.a  denote  that  the  values  of  the  attnbute  a  in  the  tuples  t,  and 
ti  are  identical.  Let  t,  a  =^*  t2.a  denote  that  the  values  of  attnbute  a  in  the  tuples  ti  and  t2  are  QIV- 
equal  with  respect  to  Sg.    Similarly,  let  t,  .a   ='  t2.a  and   t,  a    = ""  t2.a    denote    i-Ievel  QlV-equal  and 

maximum-level  QIV -equal  respectively  between  the  values  of  ti.a  and  t2.a. 

4.3.2.I.   Selection 

Selection  is  a  unary  operation  which  selects  only  a  honzonul  subset  of  a  quaUty  relation  (and 
its  corresponding  quality  indicator  relations)  based  on  the  conditions  specified  in  the  Selection 
op>eration.  There  are  two  types  of  conditions  in  the  Selection  ofjeration:  regular  conditions  for  an 
application  attribute  and  quality  conditions  for  the  quality  indicator  relations  corresponding  to  the 
application  attribute.  The  selection,  a'^c  (qr)-  'S  defined  as  follows: 

o^C^qD-UI  Vti  €  qr.   Vac  QR,((t.a  =  ti  a)  A(t.a  = ""  t,.a))  a  C(ti)) 

whereC(ti)»e,  <ftej  ,»...<&  e„<t>e,<i<t)  ez 'I  (t)...*^;  6;  is  in  one  of  the  forms:  (tt-a  9  constant) 
or  (tiaetib);  ei''  is  of  the  forms  (qik  =  constant)  or  (ti.a  =^'*ti.b)or  (tia  ='ti.b)or  (ti  a 
="  ti  .b);  qik  €  QUa);  *  6   (  a.  v.  ^  );  9  =  (S.  2.  s.  *,  <.  >.  -);  and  Sa,b  >«  the  set  of  quality 
indicators  to  be  compared  during  the  comp>anson  of  ti  a  and  t]  b. 
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Example  1:  Get  all  Large_and_Medium  companies  whoso  earnmgs  estimate  is  over  2  and  is  supplied  bv 
Zacks  Investment  Research. 

A  corresponding  extended  SQL  query  is  ^hown  as  tollows: 


SELECT 

FROM 

WHERE 

with  QUALITY 


CN,  CEO.  EE 

LARGE   AND   MEDIUM 

EE>2 

EESRCrSRC2=Zacks' 


This  SQL  query  can  be  accomplished  through  a  Selection  operation  in  the  quality  indicator 
algebra.  The  result  is  shown  below. 


Table  9 


Table  9.1 


<CN,  nile> 

<CEO,  nili:> 

<EE,  EEi> 

<IBM,  niU> 

<|  .Akers.  nilo 

<b08.  id010U> 

<EE«,  nilo                    <SRC1,SRCU> 

<Reporting_date,  nil«> 

<id0101 1.  nilt>               <Nexis.  id0201  i> 

<  1 0-07-92,  nil«> 

Table  9.2 


<SRCU,  mio 


<id0201«,  nil(t> 


<SRC2,  nilt> 


<Zaclts,  nik> 


<Reporting  date,  nilo 


<  1-07-92.  ml t> 


Note  that  in  the  conventional  relational  model,  only  Table  9  would  be  produced  as  a  result  of 
this  SQL  query.  Whereas,  in  the  quality  indicator  algebra.  Tables  9.1,  9.2  are  also  produced.  Table  9 
shows  that  the  earnings  estimate  for  IBM  is  6.08;  and  the  quality  indicator  values  in  Tables  9.1  and  9.2 
show  that  the  data  is  retneved  from  the  Nexis  database  on  October  7,  1992,  which,  in  turn,  is  based  on 
data  reported  by  Zacks  Investment  Research  on  January  7,  1992.  An  experienced  user  could  infer  from 
these  quality  indicator  values  that  the  estimate  is  credible,  given  that  Zacks  is  a  reliable  source  of 
earnings  estimates. 

4.3.2.2.  Projection 

Projection  is  a  unary  operation  which  selects  a  vertical  subset  of  a  quality  relation  based  on  the 

set  of  attributes  specified  in  the  Projection  operation.  The  result  includes  the  projected  quality  relation 

and  the  corresponding  quality  indicator  relations  that  are  associated  with  the  set  of  attributes 

specified  in  the  Projection  operation. 

Let  PJ  be  the  attribute  set  specified,  then  the  Projection,  n''pj  (qr),  is  defined  as  follows: 


n  pj  (qr)  «  {t  1  V  t,  e  qr,  ya  e  P),  ((t.a  »  t,  a  )  a  (t.a    = "    t,  .a 


))) 


Example  2:  Get  company  names  and  earnings  estimates  of  all  Large_and_Medium  companies 
A  correspoixling  SQL  query  is  shown  as  follows: 
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SELECT  CN,  EE 

FROM    LARGE, and_MEDIUM 

This  SQL  query  can  be  accomplished  through  a  Proiection  operation.  The  result  is  shown  below. 


<C\,  r\\\i> 

^EE.  EEo 

<1BM,  niU> 

<DEC,  nile> 

<TI,  nili> 

<6  J8.  ;dOlOU> 
<-0.32,  id0102i> 
<2.5l,id0103«> 

<EEj.  nik> 

<SRC1   SRCIo 

<Reponing_date,  n\\t> 

<id010U,  nilo 
<id0102«,  nile> 
<id0103«,  nil«> 

cNexis,  id0201i> 
<Nexis,  id0202«> 
<Lotus,  id0203<> 

<  10-07-92,  nii«> 

<  10-07-92,  nilo 

<  10-07-92,  niif> 

<SRCU,  niJ«> 

<SRC2,  niU> 

<Reponing_date,  nilt> 

<id020U,  nil«> 
<id0202«,  nil«> 
<id0203c,  nil«> 

<Zacks,  nil«> 
<First  Boston,  nilo 
<First  Boston,  nil«> 

<1 -07-92,  niU> 
<  1 -07-92,  nil«> 
<1 -07-92,  nil«> 

4.3.2.3.  Union 

In  Union,  the  two  operand  quality  relations  must  be  Ql-Compatible.  The  result  includes  (1) 
tuples  from  both  qr  and  qs  after  eliminahon  of  duplicates,  and  (2)  the  corresponding  quality  indicator 
relations  that  are  associated  with  the  resulhng  tuples. 

qr  u**  qs  =  qr  u  {  t  I   V  t^  e  qs,  3t,  e  qr. 

Vae  QR,  ((t.a  =  t2.a  )a  (t.a   s^tj.a)    a  ^  ((ti.a  =  tj.a  )  a  (t,.a  ='•  tj.a)))} 

In  the  above  expression, "-.  (t,  a  =  t2.a  a  t,  a  =^*  t2  a) "  is  meant  to  eliminate  duplicates.  Tuples 
t,  and  tj  are  considered  duplicates  provided  that  (1)  there  is  a  match  between  their  corresponding 
attribute  values  (i.e.,  ti.a  =  t2.a  )  and  (2)  these  values  are  QlV-equal  with  respect  to  the  set  of  quality 
indicators  (Sa)  specified  by  the  user  (i.e.,  t,.a  =  *  tj.a). 

Example  3-1:  Get  company  names,  CEO  names,  and  earnings  estimates  of  all  Large_and_Medium  and 
Small.and.Medium  companies. 
A  corresponding  extended  SQL  query  is  shown  as  follows: 


LM.CN,  LM.CEO,  LM  EE 
LARGE  and  MEDIUM  LM 


SELECT 

FROy 

UNION 

SELECT  SM.CN,  SM  CEO,  SM  EE 

FROM  SMALL  and  MEDIUM   SM 

with  QUALITY  (LM.EE.SRCl  SRC2=  SM  EE  SRCl  SRC2) 


This  SQL  query  can  be  accomplished  through  a  Union  opcranon.  The  result  is  shown  below. 
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<C.\',  niU> 

<eEO  r:U-> 

<tE.  EEe> 

<IBM,  niU> 

<)  Akcrs,  n\ii> 

<b08.  idOlOU> 

<DEC.  nil«> 

<K  Olson,  nilo 

<-0  32,  idOl02j> 

<T1,  nilo 

<|  I'jnkins,  niU'> 

<2.5l,id0103t> 

<Apple.  nil«> 

<l  Scullcv,  nilv> 

<5  69,  idnOU> 

<TI,  nil«> 

<l  lunkins.  nilc> 

<2.5l,idll03«> 

<EE«,  n\\t> 

<SRC1,  SRClo 

<Reportini^_date,  nilo 

<id010U,  nil«> 

<N'exi5,  id0201>:> 

<  10-07-92,  nilt> 

<id0102<,  nile> 

<Nexis,  id0202i> 

<10-07-92,  nil(f> 

<id0103€,  nil«> 

<Lotus,  id0203c> 

<l0-07-92,  nilo 

<idllOlc,  nilt> 

<Lotus,  idl20l?> 

<  10-07-92,  nilt> 

<idll03«,  n\\t> 

<Lotus.  ;dl203c> 

<  10-07-92,  nilf> 

<SRCU,  niU> 

<SRC2,  nil«> 

<Reportm(5_date,  niU> 

<id020U,  nil«> 

<Zacks,  mlo 

<  1-0  7-92.  niU> 

<id0202«,  nile> 

<First  Boston,  nii«> 

<1 -07-92,  mlo 

<id0203<,  nU«> 

<First  Boston,  nil«> 

<1 -07-92.  ml«> 

<idl20U,  rulo 

<Zacks,  ni!»> 

<1 -07-92,  ml  «> 

<idl203<,  mlo 

<Zacks,  n\\t> 

<1 -07-92.  mlo 

Note  that  there  are  two  tuples  corresponding  to  the  company  TI  in  the  result  because  their 
quality  indicator  values  are  different  with  respject  to  SRC2. 

Example  3-2:  If  the  quality  requirement  were  (LM.EE.SRC1=  SM.EE.SRCl)  then  these  two 
tuples  would  be  considered  duplicates  and  only  one  tuple  tor  Tl  is  retained  in  the  result.  The  result  of 
this  query  is  shown  below: 


<CN,  nilo 

<CEO,  mlo 

<£E.EEo 

<IBM,  nilo 

<J  Aloers,  nilo 

<6.08.id0101o 

<DEC,  nilo 

<KOIsen,  nilo 

<-0.32,  id0102o 

<TI,  nilo 

<J  Junkins,  mlo 

<23l,id0103o 

<Apple,  nilo 

<J  Sculley.  nilo 

<5.69,idll01O 

<£E<.nilo 

<SRa.SRClo 

<Reporting_date.  nilo 

<id0101c  nilo 

<Nexis,  id020U> 

<  10-07-92.  mlo 

<id0102<,  nilo 

<Nexis,  id0202o 

<10-07-9Z  mlo 

<id(n03<.  nilo 

<Lotus,  id0203o 

<10-07-9Znilo 

<idn01<,nilo 

<  Lotus,  id  1201  o 

<lO-07-9Zmlo 

<SRO<,  nilo 

<SRC2.  mlo 

<Reporting_date,  nilo 

<id02(n<,  mlo 

<Zacks,  mlo 

<  1-07-92,  mlo 

<id0202«,  mlo 

<Fir5»  Boston,  nilo 

<  1-07-92.  nilo 

<id0203«,  mlo 

<First  Boston,  mlo 

<1 -07-91  nilo 

<idl201<,  mlo 

<Zaclts,  nilo 

<  1-07-92,  mlo 
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\ote  also  that  unlike  the  relational  union,  the    quahty  union  operation  is  not  commutative 
This  is  illustrated  in  Example  3-3  below 

Example  3-3:  Consider  the  following  extended  SQL  query  which  switches  the  order  ot  the  union 
op>eration  in  Example  3-b: 


SELECT 

FROM 

UNION 

SELECT 

FROM 

with  QUALITY 


SMCN.  SMCEO,  SM  EE 
SMALL_and_MEDILM    SM 

LM.CN,  LM  CEO,  LM  EE 
LARCE_and   MEDIUM   LM 
(LM.EE.SRC1=  SMEE.SRCl) 


The  result  is  shown  below. 


<CNI,  mlo 

<CEO,  nil«> 

<EE.  EEo 

<IBM,  nilt> 

<DEC,  nil«> 

<Apple,  mlo 

<T1,  nilo 

<|  Akers,  nil«> 
<K  Olsen,  niU> 
<J  Sculley,  mlo 
<J  Junkins,  niio 

<6  08,  idOl01t> 
<-0  32.  id0102«> 
<5.69,  idllOlo 
<2.51,idn03o 

<EEt,  mlo 

<SRC1,SRCU> 

< Reporting  date,  mlo 

<id0l0U,  mlo 
<id0102«,  mlo 
<idl  101c,  mlo 
<idll03<,  mlo 

<Nexis,  idOZOlo 
<Nexis,  id0202«> 
<Lotus,  idl201o 
<  Lotus,  id  1203O 

<  10-07-92,  mlo 

<  10-07-92,  nilo 
<l0-07-92,  mlo 
<104)7-9i  mlo 

<SRCU,  mlo 

<SRC2,  mlo 

<Reporting_date,  mlo 

<id0201<,  mlo 

<Zacks,  mlo 

<l-07-9Zmlo 

<id0202e,  nUo 

<Rrst  Boston,  mlo 

<l-07-9ZnUo 

<id  1201c,  nilo 

<Zacks,  mlo 

<  1-07-92,  mlo 

<idl202c,  mlo 

<Zacks,  mlo 

<l-07-9Zmlo 

In  the  above  result  the  tuple  corresponding  to  TI  is  taken  from  SMALL_and_MEDIUM 
companies.  On  the  other  hand,  in  Example  3-2  it  is  taken  from  the  LARGE_and_MEDIUM  companies. 

4.3.?.4.  Difference 

In  Difference,  the  two  operand  quality  relations  must  be  Ql-Compatible.  The  result  of  this 
operation  consist*  of  all  tuples  from  qr  which  are  not  equal  to  tuples  in  qs.  During  this  equality  test  the 
quality  of  attributes  specified  by  the  user  for  each  attnbuie  value  in  the  tuples  t,  and  t^  will  also  be 
taken  into  consideration. 

qr  -"'  qs  »  { t  I V  t,  6  qr,  3t2  e  qs, 

V  a€  QR,  ((ta  «  t,  .a  )a  (t.a  -"^  t,  a  )  a    ^  ((t,  .a  «  ti-a)  a  (t,  .a  '*  tz.a))  )) 
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Example  4:   Get  all  the  companies  which  are  clabSiticcJ  as  only  Large_and_Medium  companies  but  not 
as  Small_and_Medium  companies. 
A  corresponding  SQL  query  is  shown  as  rollows: 


SELECT 

FROM 

DIFFERENCE 

SELECT 

FROM 

with    QUALITY 


LM.CN,  LMCEO,  LM  EE 
LARCE.and.MEDILM   LM 

SM  CN,  SM  CEO,  SM  EE 
SMALL_and_MEDILV1  S.V1 
(LM  EE.SRCLSRC2  =  SM  EE  SRCl  SRC2) 


This  SQL  query  can  be  accomplished  through  a  Difference  operation.    The  result  is  shown 


below. 


<CN,  nil> 

<CEO,  nil«> 

<EE,  EE> 

<1BM,  nil«> 
<TI,  nilo 

<J  Akers,  nil«> 
<J  Junkins,  nilf> 

<6-08,  id0101t> 
<23\.  id0103<> 

<EE«,  nil> 

<SRC1,SRCU> 

<Reponing_date,  nilo 

<id010U,  mlt> 
<id0l03<,  nilo 

<Nexis,  id020U> 
<Lonjs,  id0203<> 

<lU-07-92,  nil«> 
<10-07-92,  nilo 

<SRCU,  nil> 

<id0201t,  nilo 
<id0203<,  nilo 


<SRC2,  nii«> 


<Zacks,  nile> 
cZacks,  nilt> 


<Reporting  date,  nile> 


<1 -07-92,  nilo 
<l-07-92,  nilo 


Note  here  that  according  to  the  conventional  relational  algebra,  the  tuple  corresponding  to  the 
compar\y  TI  must  not  be  included  in  the  result.  But  in  quality  indicator  algebra  the  tuple  corresponding 
to  the  company  TI  from  the  relation  Large_and_Medium  is  included  in  the  result  because  the 
corresponding  tuple  in  the  relation  Small_and_Medium  has  different  quality  indicators  than  those  of 
the  relation  Large_and_Medium.  In  the  following  paragraph,  an  example  is  provided  to  demonstrate 
the  change  in  the  contents  of  results  when  quality  requirements  changes. 

If  the  constraint  in  the  QUALITY  part  of  the  query  were  (LM.EE.SRCl  =  SM.EE.SRCl)  then  the 
result  is  as  follows: 


<CN,  nilo 


<IBM,  nilo 


<CEO,  nilo 


<J  Aker»,  nilo 


<EEEEo 


<6.08,  idOlOlo 


a 


<£EcnilO  |<SRC1,SRCU>         |  <Repomnj^_dat<,  nilo 


<id0101<.nilo      r<Nexis,  idQ201  o 


<  10-07-92.  nilo 


<SRO«,  nilo 


odCnOlc,  nilo 


<SRCZ  nilo 


cZacks,  nilo 


<Repomn2  date,  n 


e  porting 


<  1-07-92,  nilo 


iiic>n 
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4.3.2.5.  Cartesian  Product 

The  Cartesian  product  is  also  a  binary  operation    Let  QR  be  of  degree  r  and  QS  be  of  degree  s. 
Let  t,  €  qr  t2  6  qs.  Let  t,(i)  denote  the  i'^  attribute  of  the  tuple  t,    and  t2(i)  denote  the  i'^  attribute  of 
the  tuple  tj.   The  tuple  t  in  the  quality  relation  resulting  from  the  Cartesian  product  of  qr  and  qs  will  be 
of  degree  r+s.  The  Cartesian  product  of  qr  and  qs,  denoted  as  qr  X^  qs,  is  defined  as  follows: 

qr  X^  qs  =  ( t  I  X^  t,  €  qr,  Vt2  6  qs, 

t(l)  =  t,(l)At(l)="'t,(l)A  t(2)  =  t,(2)  A  t(2)="^  t,(2)A  ...  t(r)  =  t,(r)A  t(r)='"t:(r)A 

t(r+l)  =  tjd)  A  t(r+l)  ='"  tjd)  A  t(r+2)  =  t2(2)  a  t(r+2)  =""  t2(2)  a  ...  t(r+s)  =  t2(s)  a  t(r+s)  ='  t2(s)  ) 

The  result  of  the  Cartesian  product  between  Large_and_Medium  and  Small_and_Medium  is 
shown  below. 
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The  set  of  quality  indicator  tables  associated  with  each  attribute  in  the  table  resulting  from 
the  Cartesian  product  are  retrieved  as  part  of  the  result. 

Other  algebraic  operators  such  as  Intersection  and  Join  can  be  derived  from  these  five 
orthogonal  operators,  as  does  in  the  relational  algebra. 
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We  have  presented  the  attnbute-based  model  including  a  description  of  the  model  structure,  a 
set  of  integrity  constraints  for  the  model,  and  a  quality  mdicator  algebra.  In  addition,  each  of  the 
algebraic  operations  are  exemplified  in  the  context  or  the  SQL  query.  The  next  section  discusses  some  of 
the  capabilities  of  this  model  and  future  research  directions. 

5.        Discussion  and  future  direcrions 

The  attribute-based  model  can  be  applied  in  many  different  ways  and  some  of  them  are  listed 
below: 

•  The  ability  of  the  model  to  support  quality  indicators  at  multiple  levels  makes  it  possible  to 
retain  the  origin  and  intermediate  data  sources.  The  example  in  Figure  9  illustrates  this. 

•  A  user  can  filter  the  data  retneved  from  a  database  according  to  quality  requirements.  In 
Example  1,  for  instance,  only  the  data  furnished  by  Zacks  Investment  Research  is  retrieved  as 
specified  in  the  clause  "with  QUALITY   EE.SRCl.SRC2='Zacks'." 

•  Data  authenticity  and  believability  can  be  improved  by  data  inspection  and  certification.  A 
quality  indicator  value  could  indicate  who  inspected  or  certified  the  data  and  when  it  was 
inspected.   The  reputation  of  the  inspector  will  enhance  the  believability  of  the  data. 

•  The  quality  indicators  associated  with  data  can  help  clarify  data  semantics,  which  can  be  used 
to  resolve  semantic  incompatibility  among  data  items  received  from  different  sources.  This 
capability  is  very  useful  in  an  interoperable  environment  where  data  in  different  databases 
have  different  semantics. 

•  Quality  indicators  associated  with  an  attribute  may  facilitate  a  better  interpretation  of  null 
values.  For  example,  if  the  value  retneved  for  the  spouse  field  is  empty  m  an  employee  record, 
it  can  be  interpreted  (i.e.,  tagged)  in  several  ways,  such  as  (1)  the  employee  is  unmarried,  (2) 
the  spouse  name  is  unknown,  or  (3)  this  tuple  is  inserted  into  the  employee  table  from  the 
materialization  of  a  view  over  a  table  which  does  not  have  spouse  field. 

•  In  a  data  quality  control  process,  when  errors  are  detected,  the  data  adnrvirustrator  can  identify 
the  source  of  error  by  examining  quality  indicators  such  as  data  source  or  collection  method. 

In  this  paper,  we  have  investigated  how  quality  indicators  may  be  specified,  stored,  retrieved, 
and  processed.  Specifically,  we  have  (1)  established  a  step-by-step  procedure  for  data  quality 
requirements  analysis  and  specification,  (2)  presented  a  model  for  the  structure,  storage,  and  processing 
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of  quality  relations  and  quality  indicator  relations  (through  the  algebra),  and  (3)  touched  upon 
functionalities  related  to  data  quality  administration  and  control. 

We  are  artively  pursuing  research  m  the  following  areas:  (Din  order  to  determine  the  quality 
of  denved  data  (e.g.,  combining  accurate  monthly  data  vvith  less  accurate  weekly  data),  we  are 
investigating  mechanisms  to  determine  the  quality  of  derived  data  based  on  the  quality  mdicator 
values  of  its  components.  (2)  In  order  to  use  this  model  for  existing  databases,  which  do  not  have 
tagging  capability,  they  must  be  extended  with  quality  schemas  instantiated  with  appropriate 
quality  indicator  values.  We  are  exploring  the  possibility  of  making  such  a  transformation  cost- 
effective.  (3)  Though  we  have  chosen  the  relational  model  to  represent  the  quality  schema,  an  object- 
oriented  approach  appjears  natural  to  model  data  and  its  quality  indicators.  Because  many  of  the 
quality  control  mechanisms  are  procedure  oriented  and  o-o  models  can  handle  procedures  (i.e., 
methods),  we  are  investigating  the  pros  and  cons  of  the  object-oriented  approach. 
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7.        Appendix  A:  Premises  about  data  quality  requirements  analysis 

Below  we  present  premises  related  to  data  quality  modeling  and  data  quality  requirements 
analysis.  To  facilitate  further  discussion,  we  define  a  data  quality  ath-ilyv,tg  as  a  collective  term  that 
refers  to  both  quality  parameters  and  quality  indicators  as  shown  in  Figure  A.l.  (This  term  is  referred 
to  as  a  quality  attribute  hereafter.) 


Data 

Quality 

Attrlbutas 

(coll*cllva) 


Figure  A.l:  Relationship  among  quality  attributes,  quality  parameters,  and  quality  indicaton. 

7.1.        Premises  related  to  data  quality  modeling 

Data  quality  modeling  is  an  extension  of  traditional  data  modeling  methodologies.  As  data 
modeling  captures  many  of  the  structural  and  semantic  issues  underlying  data,  data  quality  modeling 
captures  many  of  the  structural  and  semantic  issues  underlying  data  quality.  The  following  four 
premises  relate  to  these  data  quality  modeling  issues. 

(Premise  1.1>  (Relatedness  between  enhty  and  quality  attributes):  In  some  cases  a  quality 
attribute  can  be  considered  either  as  an  entity  attnbuie  (i.e.,  an  application  entity's  attribute)  or  as  a 
quality  attribute.  For  example,  the  name  of  a  teller  who  performs  a  transaction  in  a  banking 
application  may  be  an  entity  attribute  if  initial  application  requirements  state  that  the  teller's  name 
be  included;  alternatively,  it  may  be  modeled  as  a  quality  attribute. 

From  a  modeling  perspective,  whether  an  attribute  should  be  modeled  as  an  entity  attribute  or 
a  quality  attribute  is  a  judgnnent  call  on  the  part  of  the  design  team,  and  may  depend  on  the  initial 
application  requirements  as  well  as  eventual  uses  of  the  data,  such  as  the  inspection  of  the  data  for 
distribution  to  external  users,  or  for  integration  with  other  data  of  different  quality.  The  relevance  of 
distribution  and  integration  of  the  information  is  that  often  the  users  of  a  given  system  "know"  the 
quality  of  the  data  they  use.  When  the  data  is  exported  to  their  users,  however,  or  combined  with 
information  of  different  quality,  that  q\iality  nruy  become  unknown. 

A  g\iidclfaM  to  this  judgment  is  to  ask  what  information  the  attribute  provides.  If  the  attribute 
provides  applicMtton  information  such  as  a  customer  name  and  address,  it  may  be  considered  an  entity 
attribute.  If,  on  the  other  hand,  the  information  relates  more  to  aspects  of  the  data  manufacturing 
process,  such  as  when,  where,  and  by  whom  the  data  was  nnanufactured,  then  this  may  be  a  quality 
attribute. 

In  short,  the  objective  of  the  datt  quality  requirement  analysis  is  not  strictly  to  develop  quality 
attributes,  but  also  to  ensure  that  important  dimensions  of  data  quality  are  not  overlooked  entirely  in 
requirement  analysis. 
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(Premise  1.2)  (Quality  attnbute  non-orthogonality):  Different  quality  attributes  need  not  be 
orthogonal  to  one  another.  For  example,  the  two  quality  parameters  credibility  and  timeliness  are 
related  (i.e.,  not  orthogonal),  such  as  for  real  time  data. 

(Prenruse  1.3)  (Heterogeneity  and  hierarchy  in  the  quality  of  supplied  data):  Quality  of  data 
may  differ  across  databases,  entities,  attributes,  and  instances.  Database  example:  information  in  a 
university  database  may  be  of  higher  quality  than  data  in  John  Doe's  personal  database.  Entity 
example:  data  about  alumni  (an  entity)  may  be  less  reliable  than  data  about  students  (an  entity). 
Attribute  example:  in  the  student  entity,  grades  may  be  more  accurate  than  are  addresses.  Instance 
example:  data  about  an  international  student  may  be  less  interpretable  than  that  of  a  domestic  student. 

7.2.  Premises  related  to  data  quality  definitions  and  standards  across  users 

Because  human  insight  is  needed  for  data  quality  modeling  and  different  people  may  have 
different  opinions  regarding  data  quality,  different  quality  definitions  and  standards  may  result.  We 
call  this  phenomenon  "data  quality  is  in  the  eye  of  the  beholder."  The  following  two  prenruses  entail 
this  phenomenon. 

(Premise  2.1)  (Users  define  different  quality  attributes):  (Quality  parameters  and  quality 
indicators  may  vary  from  one  user  to  another.  (Quality  parameter  example:  for  a  manager  the  quality 
f>arameter  for  a  research  report  may  be  inexpensive,  whereas  for  a  financial  trader,  the  research  report 
may  need  to  be  credible  and  timely.  Quality  indicator  example:  the  manager  may  measure 
inexpensiveness  in  terms  of  the  quality  indicator  (m(5netary)  cost,  whereas  the  trader  may  measure 
inexpensiveness  in  terms  of  opportunity  cost  of  her  own  time  and  thus  the  quality  indicator  may  be 
retrieval  time. 

(Premise  2.2)  (Users  have  different  quality  standards):  Acceptable  levels  of  data  quality  may 
differ  from  one  user  to  another.  For  example,  an  investor  following  the  movement  of  a  stock  may 
consider  a  fifteen  nninute  delay  for  share  price  to  be  sufficiently  timely,  whereas  a  trader  who  needs 
price  quotes  in  real  time  may  not  consider  fifteen  minutes  to  be  timely  enough. 

7.3.  Premises  related  to  a  single  user 

A  single  user  may  have  different  quality  attributes  and  quality  standards  for  the  different 
data  used.  This  phenomenon  is  summarized  in  Premise  3  below. 

(Premise  3)  (For  a  single  user;  non-uniform  data  quality  attributes  and  standards):  A  user  nnay 
have  different  quality  attributes  and  quality  standards  across  databases,  entities,  attributes,  or 
instances.  Across  attributes  exajnple:  A  user  may  need  higher  quality  information  for  the  phorie  numt)er 
than  for  the  numba  of  employees.  Across  instances  example:  A  user  may  need  high  quality  information 
for  certain  compB\ics,  but  iK>t  for  others  due  to  the  fact  that  some  companies  are  of  particular  interest 
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